Copula Processes 



o 



Andrew Gordon Wilson 
Zoubin Ghahramani 
Department of Engineering 
University of Cambridge 
Cambridge, UK CB2 IPZ 
^ agw38(§ c am . ac . uk 

^ zoubin@eng.caiii.ac.uk 

o^ 

Abstract 

We define a copula process which describes the dependencies between arbitrarily many random 
variables independently of their marginal distributions. As an example, we develop a stochastic 
volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard 

5^ deviations of a sequence of random variables. To make predictions we use Bayesian inference, 

with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We 
I— ' find both methods comparable. We also find our model can outperform GARCH on simulated 

and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate 

^ covariates other than time, and model a rich class of covariance structures. 

o 
in 

^ 1 Introduction 

^ . . . 

Imagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these measurements 
correlate with one another. How much does the value of the measurement at fifteen minutes depend on the 
measurement at five minutes? Once we've learned this correlation structure, suppose we want to compare 
^ it to the dependence between measurements of the rocket's velocity. To do this, it is convenient to separate 

dependence from the marginal distributions of our measurements. At any given time, a rocket's distance 
rS from Earth could have a Gamma distribution, while its velocity could have a Gaussian distribution. And 

separating dependence from marginal distributions is precisely what a copula function does. 

While copulas have recently become popular, especially in financial applications [T|[nj, as Nelsen |3] writes, 
"the study of copulas and the role they play in probability, statistics, and stochastic processes is a subject 
still in its infancy. There are many open problems. . . " Typically only bivariate (and recently trivariate) 
copulas are being used and studied. In our introductory example, we are interested in learning the correla- 
tions in different stochastic processes, and comparing them. It would therefore be useful to have a copula 
process, which can describe the dependencies between arbitrarily many random variables independently of 
their marginal distributions. We define such a process. As an example, we develop a stochastic volatility 
model, Gaussian Copula Process Volatility (GCPV) . In doing so, we provide a Bayesian framework for the 
learning the marginal distributions and dependency structure of what we call a Gaussian copula process. 

The volatility of a random variable is its standard deviation. Stochastic volatility models are used to predict 
the volatilities in a heteroscedastic sequence - a sequence of random variables with different variances, like 
distance measurements of a rocket as it leaves the Earth. As the rocket gets further away, the variance 
on the measurements increases. Heteroscedasticity is especially important in econometrics; the returns on 
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equity indices, like the S&P 500, or on currency exchanges, are heteroscedastic. Indeed, in 2003, Robert 
Engle won the Nobel Prize in economics "for methods of analyzing economic time series with time- varying 
volatility". GARCH [4i, a generalized version of Engle's ARCH, is arguably unsurpassed for predicting the 
volatility of returns on equity indices and currency exchanges |5l[6l[7]. GCPV can outperform GARCH, 
and is competitive on financial data that especially suits GARCH [H IHl HO]- Before introducing GCPV, 
we first discuss copulas and then introduce our copula process. For a review of Gaussian processes, see 
Rasmussen and Williams [llj . 

2 Copulas 

Copulas are important because they separate the dependency structure between random variables from 
their marginal distributions. Intuitively, we can describe the dependency structure of any multivariate joint 
distribution H{xi, . . . , a;„) = P{Xi < xi, . . . X„ < Xn) through a two step process. First we take each 
univariate random variable Xi and transform it through its cumulative distribution function (cdf) Fi to 
get Ui = Fi{Xi), a uniform random variable. We then express the dependencies between these transformed 
variables through the n-copula C(ui, . . . Formally, an n-copula C : [0, l]" [0, 1] is a multivariate 

cdf with uniform univariate marginals: C(ui, U2, . . . , m„) = P{Ui < ui,C/2 < U2,---,Un < u„), where 
Ui,U2, ■ ■ ■ ,Un are standard uniform random variables. Sklar |12j precisely expressed our intuition in the 
theorem below. 

Theorem 2.1. Sklar's theorem 

Let H he an n- dimensional distribution Junction with marginal distribution functions Fi, F2, . . . , Fn. Then 
there exists an n-copula C such that for all (xi, X2, . . . , Xn) G [—00, 00]", 

H{xi,X2, . . .,Xn) = C{Fi{xi),F2{x2), ■ ■ . , Fn{Xn)) = C(mi,U2, ■ • ■,Un). (1) 

If Fi, F2, ■ ■ ■ , Fn are all continuous then C is unique; otherwise C is uniquely determined on Range -Fi x 
Range -F2 x ••• x Range Conversely, if C is an n-copula and Fi, F2, . . . , F^ are distribution func- 
tions, then the function H is an n-dimensional distribution function with marginal distribution functions 
Fi, F2, ■ . ■ , Fn- 

As a corollary, if F^ ^' (u) — infja; : F(x) > u}, the quasi-inverse of Fi, then for all Mi, W2, . . . , u„ £ [0, 1]", 
C{ui,U2, . . . , u„) = HiFi-'\ui),Ft'\u2), . . . , Fi-'\u„)). (2) 

In other words, ([2| can be used to construct a copula. For example, the bivariate Gaussian copula is 
defined as 

CK«) = $p($-l(w),<i>-i(t;)), (3) 

where $p is a bivariate Gaussian cdf with correlation coefficient p, and $ is the standard univariate 
Gaussian cdf. Li [2j popularised the bivariate Gaussian copula, by showing how it could be used to study 
financial risk and default correlation, using credit derivatives as an example. 

By substituting F{x) for u and G{y) for v in equation ([3]), we have a bivariate distribution H{x, y), with a 
Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting H{x, y) can 
still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is then a copula 
itself that captures the underlying dependencies between random variables, regardless of their marginal 
distributions. For this reason, copulas have been called dependence functions [131 114j . Nelsen [3] contains 
an extensive discussion of copulas. 
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3 Copula Processes 



Imagine choosing a covariance function, and then drawing a sample function at some finite number of 
points from a Gaussian process. The result is a sample from a collection of Gaussian random variables, 
with a dependency structure encoded by the specified covariance function. Now, suppose we transform 
each of these values through a univariate Gaussian cdf, such that we have a sample from a collection of 
uniform random variables. These uniform random variables also have this underlying Gaussian process 
dependency structure. One might call the resulting values a draw from a Gaussian- Uniform process. We 
could subsequently put these values through an inverse beta cdf, to obtain a draw from what could be 
called a Gaussian-Beta process: the values would be a sample from beta random variables, again with an 
underlying Gaussian process dependency structure. Alternatively, we could transform the uniform values 
with different inverse cdfs, which would give a sample from different random variables, with dependencies 
encoded by the Gaussian process. 

The above procedure is a means to generate samples from arbitrarily many random variables, with ar- 
bitrary marginal distributions, and desired dependencies. It is an example of how to use what we call a 
copula process - in this case, a Gaussian copula process, since a Gaussian copula describes the underlying 
dependency structure of a finite number of samples. We can now formally define a copula process. 

Definition 3.1. Copula Process 

Let {Wt\ he a collection of random variables indexed by t G T, with marginal distribution functions Ft, 
and let Qt = Ft(Wt). Further, let ^ be a stochastic process measure with marginal distribution functions 
Gt, and joint distribution function H. Then Wt is copula process distributed with base measure fj,, or 
Wt ^ CV{fJ,), if and only if for all n £ N, G M, 

n 

P{[]{g''u^\Qu) < aJ) = ^ti.t......t„(ai, a2, ■ • • , a„). (4) 

1=1 

Note that each Qt- ~ Uniform(0, 1), and that g[. '^'^ is the quasi-inverse of Gj. , as it was previously defined. 
Definition 3.2. Gaussian Copula Process 

Wt is Gaussian copula process distributed if it is copula process distributed and the base measure ^ is a 
Gaussian process. If there is a mapping ^ such that ^{Wt) ~ QV{m{t),k(t,t')), then we write Wt ~ 

gcr{^,m{t),k{t,t')). 

For example, if we have Wt ~ QCV with m{t) = and k{t,t) = 1, then in our definition of a copula 
process, Gt — the standard univariate Gaussian cdf, and H is the usual GP joint distribution function. 
Supposing this GCP is a Gaussian-Beta process, then = <I>^^ o Fb, where Fb is a univariate Beta cdf. 
One could similarly define other copula processes. 

We described generally how a copula process can be used to generate samples of arbitrarily many random 
variables with desired marginals and dependencies. We now develop a specific and practical application of 
this framework. We introduce a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), 
as an example of how to learn the joint distribution of arbitrarily many random variables, the marginals of 
these random variables, and to make predictions. To do this, we fit a Gaussian copula process by using a 
type of Warped Gaussian Process [T^. However, our methodology varies substantially from Snelson et al. 
|15] . since we are doing inference on latent variables as opposed to observations, which is a much greater 
undertaking that involves approximations, and we are doing so in a different context. 
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4 Gaussian Copula Process Volatility 



Assume we have a sequence of observations y — (j/i, . . . , yn)^ at times i = (ti, . . . , The observations 

are random variables with different latent standard deviations. We therefore have n unobserved standard 
deviations, cti, . . . , cr„, and want to learn the correlation structure between these standard deviations, and 
also to predict the distribution of cr* at some unrealised time i^. 

We model the standard deviation function as a Gaussian copula process: 

(Tt^gCV{g~\Q,k{t,t')). (5) 

Specifically, 

f ^gV{m[t)^Q,k{t.i')) (6) 
a(t)=g(/(t),a;) (7) 

y\t^N{Q,n^{t)), (8) 

where is a monotonic warping function, parametrized by u). For each of the observations y = (j/i, . . . , j/n)^ 
we have corresponding GP latent function values / = (A, • ■ • , /n)^, where a{ti) = g{fi,u)), using the 
shorthand fi to mean f{ti). 

Ct ^ QCV , because any finite sequence (ai, . . . , Up) is distributed as a Gaussian copula: 

P{ax<ai,...,ap<ap)^P{g'\ai)<g'\ax),...,g-\cjp)<g-\ap)) (9) 

= $r(5-'(«i), . . . ,5"'(ap)) = $r(<i>-'(F(ai)), • • • , <^-\F{ap))) 

= $r($"'("i), . ■ . , = . . . 

where $ is the standard univariate Gaussian cdf (supposing k{t,t) — 1), $r is a multivariate Gaussian 
cdf with covariance matrix Tij — cav{g~^{ai),g~^{aj)), and F is the marginal distribution of each ai. In 
(|5|, we have 5' = because it is g^^ which maps at to a GP. The specification in ^ is equivalently 
expressed by ([6| and ([t]). With GCPV, the form of g is learned so that g^^{at) is best modelled by a GP. 
By learning g, we learn the marginal of each a: F{a) = ^{g~^{a)) for a g M. Recently, a different sort of 
'kernel copula process' has been used, where the marginals of the variables being modelled are not learned 
|16j . Further, we also consider a more subtle and flexible form of our model, where the function g itself 
is indexed by time: g = gt{f{t),uj). We only assume that the marginal distributions of at are stationary 
over 'small' time periods, and for each of these time periods (5)-(7) hold true. We return to this in the 
final discussion section. 

Here we have assumed that each observation, conditioned on knowing its variance, is normally distributed 
with zero mean. This is a common assumption in heteroscedastic models. The zero mean and normality 
assumptions can be relaxed and are not central to this paper. 



5 Predictions with GCPV 

Ultimately, we wish to infer p{a{t^,)\y, z), where z = {9,u}}, and 6 are the hyperparameters of the GP 
covariance function. To do this, we sample from 

p{f*\y,z) = I p{Mf,e)p{f\y,z)df (10) 

and then transform these samples by g. Letting (Cf)ij ~ dijg{fi,u})^ , where Sij is the Kronecker delta, 
Kij — k{ti,tj), (fc*)i = k{t^,,ti), we have 

p{f\y, z) = A/-(/; 0, K)N{y; 0, Cf)lp{y\z), (11) 
p{Mf,9)^M{kjK-^f,k(U,t,)-kjK-^k,). (12) 
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We also wish to learn 2;, which wc can do by finding the z that maximizes the marginal likelihood, 



V{y\z)^ I p{y\f,iv)p{f\e)df. (13) 



Unfortunately, for many functions g, ( 10 ) and ( 13 1 are intractable. Our methods of dealing with this 



can be used in very general circumstances, where one has a Gaussian process prior, but an (optionally 
parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate p{f\y,z) as a 



Gaussian. Then we can integrate (10) for a Gaussian approximation to p[f^,\y,z), which we sample from 
to make predictions of cr*. Using Laplace, we can also find an expression for an approximate marginal 
likelihood, which we maximize to determine z. While we always use Laplace to determine z, we compare 
to a full Laplace solution by also using Markov chain Monte Carlo to sample from p{f*\y, z). 

Let us now relate the above to the Gaussian copula in ([9|. The prior Tij = cav{g^^{ai),g^^{aj)) — 
cov(/i, fj) — k(ti, tj). The posterior Fy can be estimated as the covariance matrix of the Laplace approx- 
imation for p{f\y)- Also, since each component of / is transformed separately, such that cr(ii) = g{f{ti)), 
we have 

p{a\y, z)^[l[ £]pif\y, z)=[ll ^^^]pif\y, z). (14) 
One can use this to simulate from the joint distribution over the deviations. 



5.1 Laplace Approximation 



The goal is to approximate (11) with a Gaussian, so that we can evaluate (10 1 and (13 1 and make pre- 



dictions. In doing so, we follow Rasmussen and Williams [TT] in their treatment of Gaussian process 
classification, except we use a parametrized likelihood, and modify Newton's method. 



First, consider as an objective function the logarithm of an unnormalized (111 

sif\y,z) ^\ogp{y\f,uj) +\ogp{f\e). 



(15) 



The Laplace approximation uses a second order Taylor expansion about the / which maximizes (15), to 
find an approximate objective s{f\y, z). So the first step is to find /, for which we use Newton's method. 

(VVs(/))-iVs(/). Differentiating 



The Newton update is /" 



VVs{f\y,z) 



Ws{f\y,z)=Wlogp{y\f,u:)- 
VVlogp(y|/,a;) - R-^ = -W 



(16) 
(17) 



where W is the diagonal matrix — VV logp(y|/, a;). 

If the likelihood function p{y\f, u>) is not log concave, then W may have negative entries. Vanhatalo et al. 
|17j found this to be problematic when doing Gaussian process regression with a Student-t likelihood. 
They instead use an expectation-maximization (EM) algorithm for finding /, and iterate ordered rank one 
Cholesky updates to evaluate the Laplace approximate marginal likelihood. But EM can converge slowly, 
especially near a local optimum, and each of the rank one updates is vulnerable to numerical instability. 
With a small modification of Newton's method, we often get close to quadratic convergence for finding 
/, and can evaluate the Laplace approximate marginal likelihood in a numerically stable fashion, with no 
approximate Cholesky factors, and optimal computational requirements. 

At a maximum, the negative Hessian of the objective function, W + K^^ , is positive definite. On each 
iteration of Newton's method, we form M by setting all negative entries of W to zero. Since is 
positive definite, and the eigenvalues of M -I- are greater than or equal to the eigenvalues of , 
M -\- is always positive definite. Using M in place of W decreases the Newton step size, and changes 
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the direction of steps. We are always stepping towards a local maximum, and will converge, barring rare 
pathologies. 

Furthermore, we reformulate our optimization in terms of B = I + M^KM^, which is often well 
conditioned: it has eigenvalues no smaller than 1, and no larger than 1 + n maxjj(i^jj)/4. Letting 
Q = M^B~^M^2^ we find {K^^ + M)~^ ^ K - KQK, and the Newton update becomes /"^"^ = Ka, 
where a = b QKb, and b — Mf + Vlogp(y|/). With these Newton updates we find / and get an 



expression for s which we use to approximate and (f3). 



The approximate marginal likelihood is given by J exp(s)(i/. Taking its logarithm, 

\ogq{y\z) = -\paf + logp(y|/) - \ log \Bfl (f8) 
where B^ \s B evaluated at /, and is a numerically stable evaluation of K~^f. 



To learn the parameters z, we use conjugate gradient descent to maximize (18 1 with respect to z. Since 
/ is a function of z, we initialize z, and update / every time we vary z. Once we have found an optimum 
z, we can make predictions. By exponentiating s, we find a Gaussian approximation to the posterior 



( 11 ), q{f\y, z) — J\f{f, K — KQK). The product of this approximate posterior with |/) is Gaussian. 



Integrating this product, we approximate p(f^\y,z) as 

q{My, z) = AA(fe7v logp(y|/), k{U,U) - kjQK). (19) 

Given n training observations, the cost of each Newton iteration is dominated by computing chol{B), 
which takes 0{n^) operations. The objective function typically changes by less than 10~^ after 3 iterations. 
Once Newton's method has converged, it takes only 0(1) operations to draw from q{f^,\y,z) and make 
predictions. 

5.2 Markov chain Monte Carlo 



We use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from 



p{a^\y,z) to make predictions. Sampling from (11) is difficult, because the variables / are strongly 



coupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling jlSj . and find it 
extremely effective for this purpose. It was specifically designed to sample from posteriors with correlated 
Gaussian priors. It has no free parameters, and jointly updates every element of /. For our setting, it is 
over 100 times as fast as axis aligned slice sampling with univariate updates. 



To make predictions, we take J samples of p{f\y, z), {/^, . . . , f"^}, and then approximate ( 10 ) as a mixture 
of J Gaussians: 

1 

JE^'(/*l/^^)■ (20) 

i=l 

Each of the Gaussians in this mixture have equal weight. So for each sample of f*\y, we uniformly choose 
a random p{f*\f^ , 9) and draw a sample. In the limit J — > oo, we are sampling from the exact p{f* |y, z). 
Mapping these samples through g gives samples from p{a^.\y,z). 

After one 0{n^) and one 0{J) operation, a draw from (20 1 takes 0(1) operations. 



5.3 Warping Function 

The warping function, g, maps /j, a GP function value, to (7^, a standard deviation. Since fi can take any 
value in M, and ai can take any non- negative real value, 17: M — > M+. For each fi to correspond to a unique 
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deviation, g must also be one-to-one. We use 



K 

g{x, w) = ^ flj log[exp[6j(x -I- Cj)] + 1], aj,bj > 0. (21) 
This is monotonia, positive, infinitely difFerentiable, asymptotic towards zero as a; — )■ —oo, and tends to 



{"^j^i ajbj)x as a; — oo. In practice, it is useful to add a small constant to (21 1, to avoid rare situations 
where the parameters u) are trained to make g extremely small for certain inputs, at the expense of a good 
overall fit; this can happen when the parameters a; are learned by optimizing a likelihood. A suitable 
constant could be one tenth the absolute value of the smallest nonzero observation. 

By inferring the parameters of the warping function, or distributions of these parameters, we are learning a 
transformation which will best model at with a Gaussian process. The more flexible the warping function, 
the more potential there is to improve the GCPV fit - in other words, the better we can estimate the 
'perfect' transformation. To test the importance of this flexibility, we also try a simple unparametrized 
warping function, 17(2;) = e^. In related work, Goldberg et al. [12] place a GP prior on the log noise level 
in a standard GP regression model on observations, except for inference they use Gibbs sampling, and a 
high level of 'jitter' for conditioning. 

Once g is trained, we can infer the marginal distribution of each a: F{a) = ^{g^^(a)), for a G M. This 
suggests an alternate way to initialize g: we can initialize -F as a mixture of Gaussians, and then map 
through to find g"^- Since mixtures of Gaussian distributions are dense in the set of probability 
distributions, we could in principle find the 'perfect' g using an infinite mixture of Gaussians [20]. 



6 Experiments 

In our experiments, we predict the latent standard deviations cr of observations y at times t, and also 
cr^ at unobserved times f*. To do this, we use two versions of GCPV. The first variant, which we 



simply refer to as GCPV, uses the warping function (21 ) with K — 1, and squared exponential covariance 
function, k{t,t') = Acxp{—{t — t')'^/P), with A = 1. The second variant, which we call GP-EXP, uses 
the unparametrized warping function e^, and the same covariance function, except the amplitude ^ is a 
trained hyperparameter. The other hyperparameter I is called the lengthscale of the covariance function. 
The greater I, the greater the covariance between at and Uf+a for a e M. We train hyperparameters by 



maximizing the Laplace approximate log marginal likelihood ( 18 ) 



We then sample from p{f*\y) using the Laplace approximation ( |19[ ). We also do this using MCMC (20) 
with J = 10000, after discarding a previous 10000 samples of p{f\y) as burn-in. We pass these samples 
of through g and g^ to draw from p(cr*|y) and p{a1\y), and compute the sample mean and variance 
of cr*|y. We use the sample mean as a point predictor, and the sample variance for error bounds on these 
predictions, and we use 10000 samples to compute these quantities. For GCPV we use Laplace and MCMC 
for inference, but for GP-EXP we only use Laplace. We compare predictions to GARCII(1,1), which has 
been shown in extensive and recent reviews to be competitive with other GARCH variants, and more 
sophisticated models [5l[6l[7]. We use the Matlab Econometrics Toolbox implementation of GARCH. 

We make forecasts of volatility, and we predict historical volatility. By 'historical volatility' we mean the 
volatility at observed time points, or between these points. Uncovering historical volatility is important. 
It could, for instance, be used to study what causes fluctuations in the stock market, or to understand 
physical systems. 

To evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy for 
the truth, and the predicted variance. Although likelihood has advantages, we are limited in space, and 
we wish to harmonize with the econometrics literature, and other assessments of volatility models, where 
MSE is the standard. In a similar assessment of volatility models, Brownlees et al. [7] found that MSE 
and quasi-likelihood rankings were comparable. 
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When the true variance is unknown we follow Brownlees et al. [7j and use squared observations as a 
proxy for the truth, to compare our model to GARChQ The more observations, the more reliable these 
performance estimates will be. However, not many observations (e.g. 100) are needed for a stable ranking 
of competing models; in Brownlees et al. the rankings derived from high frequency squared observations 
are similar to those derived using daily squared observations. 

6.1 Simulations 

We simulate observations from A/'(0, cr^(t)), using a{t) = sin(t) cos(i^) + 1, at t = (0,0.02,0.04, . . . ,4)^. 
We call this data set TRIG. We also simulate using a standard deviation that jumps from 0.1 to 7 and back, 
at times t = (0, 0.1, 0.2, . . . , 6)^. We call this data set JUMP. To forecast, we use all observations up until 
the current time point, and make 1, 7, and 30 step ahead predictions. So, for example, in TRIG we start by 
observing t — 0, and make forecasts a,t t = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts 
at t = 0.04,0.16,0.62, and so on, until all data points have been observed. For historical volatility, we 
predict the latent at at the observation times, which is safe since we are comparing to the true volatility, 
which is not used in training; the results are similar if we interpolate. Figure 1 panels a) and b) show 
the true volatility for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP 
Laplace, and GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and 
historical volatility. 

In panel a) we see that GCPV more accurately captures the dependencies between a at different times 
points than GARCH: if we manually decrease the lengthscale in the GCPV covariance function, we can 
replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance between at and 
Ut+a decreases quickly with increases in a. We also see that GCPV with an unparametrized exponential 
warping function tends to overestimates peaks and underestimate troughs. In panel b), the volatility 
is extremely difficult to reconstruct or forecast ~ with no warning it will immediately and dramatically 
increase or decrease. This behaviour is not suited to a smooth squared exponential covariance function. 
Nevertheless, GCPV outperforms GARCH, especially in regions of low volatility. We also see this in panel 
a) for t £ (1.5,2). GARCH is known to respond slowly to large returns, and to overpredict volatility [21 . 
In JUMP, the greater the peaks, and the smaller the troughs, the more GARCH suffers, while GCPV is 
mostly robust to these changes. 

6.2 Financial Data 

The returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain Pound 
(GBP) from 1984 to 1992 have become a benchmark for assessing the performance of GARCH models 
[Slinilin]- This exchange data, which we refer to as DMGBP, can be obtained from www.datastream. com, 
and the returns are calculated as rt = log{Pt+i / Pt) , where Pt is the number of DM to GBP on day t. The 
returns are assumed to have a zero mean function. 

We use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility 
forecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992 (659 
trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time we retrain 
parameters, we predict historical volatility over the past 120 days. The average MSE for these historical 
predictions is given in Table 1, although they should be observed with caution; unlike with the simulations, 
the DMGBP historical predictions are trained using the same data they are assessed on. In Figure Ic), we 
see that the GARCH one day ahead forecasts are lifted above the GCPV forecasts, but unlike in the 
simulations, they are now operating on a similar lengthscale. This suggests that GARCH could still be 
overpredicting volatility, but that GCPV has adapted its estimation of how at and at+a correlate with 
one another. Since GARCH is suited to this financial data set, it is reassuring that GCPV predictions 

^Since each observation y is assumed to have zero mean and variance cr^, E[j/^] = a^. 
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Table 1: MSE for predicting volatility. 



Data set 


Model 


Historical 


1 step 


7 step 


30 step 


TRIG 


GCPV (LA) 


0.953 X 10~^ 


0.588 X 10" 


0.951 X 10° 


1.71 X 10" 




GCPV (MCMC) 


0.760 X IQ-^ 


0.622 X 10" 


0.979 X 10" 


1.76 X 10° 




GP-EXP 


1.93 X 10"^ 


0.646 X 10" 


1.36 X 10" 


1.15 X 10° 




GARCH 


9.38 X IQ-^ 


1.04 X 10" 


1.79 X 10" 


5.12 X 10° 


JUMP 


GCPV (LA) 


0.588 X 10^ 


0.891 X 10^ 


1.38 X 10^ 


0.135 X 10* 




GCPV (MCMC) 


1.21 X 10^ 


0.951 X 10^ 


1.37 X 10^ 


0.135 X 10* 




GP-EXP 


1.43 X 10^ 


1.76 X 10^ 


6.95 X 10^ 


1.47 X 10* 




GARCH 


1.88 X 10^ 


1.58 X 10^ 


3.43 X 10^ 


0.565 X 10* 


DMGBP 


GCPV (LA) 


2.43 X 10"" 


3.00 X 10"" 


3.08 X 10"" 


3.17 X 10"" 




GCPV (MCMC) 


2.39 X 10"" 


3.00 X 10"" 


3.08 X 10"" 


3.17 X 10"" 




GP-EXP 


2.52 X 10"" 


3.20 X 10"" 


3.46 X 10~" 


5.14 X 10"" 




GARCH 


2.83 X 10"" 


3.03 X 10"" 


3.12 X 10"" 


3.32 X 10"" 



have a similar time varying structure. Overall, GCPV and GARCH are competitive with one another 
for forecasting currency exchange returns, as seen in Table 1. Moreover, a learned warping function 
g outperforms an unparametrized one, and a full Laplace solution is comparable to using MCMC for 
inference, in accuracy and speed. This is also true for the simulations. Therefore we recommend whichever 
is more convenient to implement. 

7 Discussion 

We defined a copula process, and as an example, developed a stochastic volatility model, GCPV, which can 
outperform GARCH. With GCPV, the volatility at is distributed as a Gaussian Copula Process, which 
separates the modelling of the dependencies between volatilities at different times from their marginal 
distributions - arguably the most useful property of a copula. Further, GCPV fits the marginals in the 
Gaussian copula process by learning a warping function. If we had simply chosen an unparametrized 
exponential warping function, we would incorrectly be assuming that the log volatilities are marginally 
Gaussian distributed. Indeed, for the DMGBP data, we trained the warping function g over a 120 day period, 
and mapped its inverse through the univariate standard Gaussian cdf and differenced, to estimate the 
marginal probability density function (pdf) of at over this period. The learned marginal pdf, shown in 
Figure Id), is similar to a Gamma(4. 15,0.00045) distribution. However, in using a rolling window to retrain 
the parameters of g, we do not assume that the marginals of at are stationary; we have a time changing 
warping function. 

While GARCH is successful, and its simplicity is attractive, our model is also simple and has a number 
of advantages. We can effortlessly handle missing data, we can easily incorporate covariates other than 
time (like interest rates) in our covariance function, and we can choose from a rich class of covariance 
functions - squared exponential, Brownian motion, Matern, periodic, etc. In fact, the volatility of high 
frequency intradaily returns on equity indices and currency exchanges is cyclical [22], and GCPV with a 
periodic covariance function is uniquely well suited to this data. And the parameters of GCPV, like the 
covariance function lengthscale, or the learned warping function, provide insight into the underlying source 
of volatility, unlike the parameters of GARCH. 

Finally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are being 
used. We introduced a copula process where one can learn the dependency structure between arbitrarily 
many random variables independently of their marginal distributions. We hope the Gaussian Copula 
Process Volatility model will encourage other applications of copula processes. More generally, we hope 
our work will help bring together the machine learning and econometrics communities. 
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1 2 3 4 2 4 6 200 400 600 0.005 0.01 
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(a) (b) (c) (d) 



Figure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV 
(MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green hne, a dashed 
thick blue line, a dashed black line, a cyan line, and a red line, a) shows predictions of historical volatility for TRIG, 
where the shade is a 95% confidence interval about GCPV (MCMC) predictions, b) shows predictions of historical 
volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV (LA) and GARCH one day 
ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line respectively show the GCPV learned 
marginal pdf of at in DMGBP and a Gamma(4. 15,0.00045) pdf. 
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